第一節：ROC曲線(1)

ROC曲線是醫學研究中常用到比較不同生物標記篩檢(或預測)疾病能力的統計方法。
相較於邏輯斯迴歸所獲得的勝算比，ROC曲線提供了視覺化的呈現，讓資訊接收者能一目了然哪項指標預測力最佳。

– 請至這裡下載本週的範例資料

dat = read.csv("Example data.csv", header = TRUE)
head(dat)

##       eGFR Disease Survival.time Death Diabetes Cancer      SBP      DBP
## 1 34.65379       1     0.4771037     0        0      1 121.2353 121.3079
## 2 37.21183       1     3.0704424     0        1      1 122.2000 122.6283
## 3 32.60074       1     0.2607117     1        0      0 118.9136 121.7621
## 4 29.68481       1            NA    NA        0      0 118.2212 112.7043
## 5 28.35726       0     0.1681673     1        0      0 116.7469 115.7705
## 6 33.95012       1     1.2238556     0        0      0 119.9936 116.3872
##   Education Income
## 1         2      0
## 2         2      0
## 3         0      0
## 4         1      0
## 5         0      0
## 6         1      0

套件『pROC』是最常拿來繪製ROC曲線的套件，請安裝並載入他

library(pROC)

第一節：ROC曲線(2)

我們先繪製eGFR對Disease的ROC曲線…

ROC1 = roc(dat[,"Disease"], dat[,"eGFR"])
plot(ROC1, col = "red")

繪製完成後，下一個問題是，我們要以哪個點作為切點(一般來說是找尋敏感度+特異度最大的點)

– 函數「which.max()」可以找尋一串數列的最大值

pos = which.max(ROC1$sensitivities + ROC1$specificities)
pos

## [1] 407

看到了，第407個切點擁有最大敏感度與特異度之和

– 利用索引函數看看切點是什麼

cut = ROC1$thresholds[pos]
cut

## [1] 33.40866

第一節：ROC曲線(2)

現在，我們要利用上節課學到的技巧，在我們的圖片上增加注釋

plot(ROC1, col = "red")
points(ROC1$specificities[pos], ROC1$sensitivities[pos], pch = 19, cex = 2)

description = paste0("cut of point: ", formatC(cut, 2, format = "f"),
                     " (Sens = ", formatC(ROC1$sensitivities[pos], 3, format = "f"),
                     " ;Spec = ", formatC(ROC1$specificities[pos], 3, format = "f"), ")")

text(ROC1$specificities[pos], ROC1$sensitivities[pos], description, pos = 1)

第一節：ROC曲線(3)

在畫ROC curve的圖時候，我們常常會把很多ROC curve重疊在一張圖上，這樣可以方便做比較。

– 在R裡面，我們可以這樣做…

ROC1 = roc(dat[,"Disease"], dat[,"eGFR"])
ROC2 = roc(dat[,"Disease"], dat[,"SBP"])
ROC3 = roc(dat[,"Disease"], dat[,"DBP"])
plot(ROC1, col = "red")
plot(ROC2, col = "darkgreen", add = TRUE)
plot(ROC3, col = "blue", add = TRUE)
legend("bottomright", c("eGFR", "SBP", "DBP"), lty = 1, lwd = 2, col = c("red", "darkgreen", "blue"))

畫完圖後，我們可以看到SBP擁有最佳的Area under the curve，但有沒有顯著呢?我們可以使用函數「roc.test()」來進行檢定：

roc.test(ROC2, ROC1)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  ROC2 and ROC1
## Z = 2.6874, p-value = 0.007202
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.5664361   0.5504805

roc.test(ROC2, ROC3)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  ROC2 and ROC3
## Z = 2.2119, p-value = 0.02698
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.5640136   0.5372651

雖然圖片上看起來差異不大，但卻是有顯著的差異！

第一節：ROC曲線(4)

ROC曲線除了被拿來比較不同連續變項對Disease的預測能力之外，也常常被拿來比較不同邏輯斯回歸模型的預測能力。

– 函數「glm()」使用後，也會產生預測機率用來評估個案陽性的機率，而這個預測機率的準確度可以透過ROC曲線進行評估

model1 = glm(dat[,"Disease"]~dat[,"SBP"], family = "binomial")
model2 = glm(dat[,"Disease"]~dat[,"SBP"] + factor(dat[,"Diabetes"]), family = "binomial")
model3 = glm(dat[,"Disease"]~dat[,"SBP"] + factor(dat[,"Diabetes"]) + factor(dat[,"Education"]) + factor(dat[,"Income"]), family = "binomial")

ROC1 = roc(model1$y, model1$fitted.values)
ROC2 = roc(model2$y, model2$fitted.values)
ROC3 = roc(model3$y, model3$fitted.values)
plot(ROC1, col = "red")
plot(ROC2, col = "darkgreen", add = TRUE)
plot(ROC3, col = "blue", add = TRUE)
legend("bottomright", c("Model 1", "Model 2", "Model 3"), lty = 1, lwd = 2, col = c("red", "darkgreen", "blue"))

這個結果告訴我們，變項用的多不見得預測能力較佳！

練習1：手刻ROC curve

雖然函數「plot()」再度幫助我們直接畫出ROC曲線，但為了之後的應用，還是需要強迫各位利用物件『ROC1』、『ROC2』及『ROC3』中的敏感度與特異度親手畫出ROC曲線

– 請你依照下列程序畫出ROC曲線

開一個空畫布
利用函數「lines()」畫出3條ROC曲線
自行繪製座標軸

練習1答案

這題的關鍵在於對ROC curve的了解：

model1 = glm(dat[,"Disease"]~dat[,"SBP"], family = "binomial")
model2 = glm(dat[,"Disease"]~dat[,"SBP"] + factor(dat[,"Diabetes"]), family = "binomial")
model3 = glm(dat[,"Disease"]~dat[,"SBP"] + factor(dat[,"Diabetes"]) + factor(dat[,"Education"]) + factor(dat[,"Income"]), family = "binomial")

ROC1 = roc(model1$y, model1$fitted.values)
ROC2 = roc(model2$y, model2$fitted.values)
ROC3 = roc(model3$y, model3$fitted.values)

plot.new()
plot.window(xlim = c(0, 1), ylim = c(0, 1))

lines(ROC1$specificities, ROC1$sensitivities, col = "red")
lines(ROC2$specificities, ROC2$sensitivities, col = "darkgreen")
lines(ROC3$specificities, ROC3$sensitivities, col = "blue")

axis(1, at = c(0, 0.2, 0.4, 0.6, 0.8, 1), labels = c("0%", "20%", "40%", "60%", "80%", "100%"), pos = 0)
axis(2, at = c(0, 0.2, 0.4, 0.6, 0.8, 1), labels = c("0%", "20%", "40%", "60%", "80%", "100%"), pos = 0, las = 2)

mtext("Specificity", side = 1, line = 3, cex.lab = 1, col = "black")
mtext("Sensitivity", side = 2, line = 3, cex.lab = 1, col = "black")

第二節：色彩透明度與函數(1)

還記得上一節課SBP對DBP的散布圖嗎?是不是感覺到有很多點重疊在一起。

– 資料量多的時候經常會遇到這樣的問題，這時候我們可能需要告訴使用者不同區域點的密度。

plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", cex = 2)

plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", pch = 19, cex = 2)

第二節：色彩透明度與函數(2)

在R裡面，我們使用的是6或8位元的16進位色碼，其規格為：#[紅色][綠色][藍色][透明度]

– 舉例來說，不透明的紅色的色碼為『#FF0000』或『#FF0000FF』

– 透明度50%的紅色色碼為『#FF000080』

– 透明度50%的黑色色碼為『#00000080』

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
points(x, y, pch = 19, cex = 2, col = c("#FF0000", "#FF0000FF", "#FF000080", "#00000080"))

如果你懶得自己想色碼，函數「rgb()」可以協助你調色

rgb(1, 0, 0, 0.5)

## [1] "#FF000080"

rgb(0.7, 0.5, 0.3, 0.7)

## [1] "#B3804DB3"

有了半透明的顏色後，剛剛的散布圖終於可以看出密度了

plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", pch = 19, cex = 2, col = "#00000030")

第二節：色彩透明度與函數(3)

事實上，函數「smoothScatter()」可以畫出與剛剛類似的散布圖：

smoothScatter(dat[,"SBP"], dat[,"DBP"], nrpoints = 0, ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP")

我們還可以幫他加註釋，但這比較難，但我們可以google看看有沒有解法

F9_1

看起來是有解法的，但要安裝套件『fields』

library(fields)

fudgeit <- function(){
  xm <- get('xm', envir = parent.frame(1))
  ym <- get('ym', envir = parent.frame(1))
  z  <- get('dens', envir = parent.frame(1))
  colramp <- get('colramp', parent.frame(1))
  image.plot(xm,ym,z, col = colramp(256), legend.only = T, add =F)
}

par(mar = c(5,4,4,5))
smoothScatter(dat[,"SBP"], dat[,"DBP"], nrpoints = 0, ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", postPlotHook = fudgeit)

練習2：置換程式碼

感受過Google大神的威力後，你應該知道如果你想要畫出漂亮的圖片，問Google最快了。

– 現在，假設你對單一色階的散布圖仍然不滿意，想要精益求精，google給了你一條明路，請參考R Scatter Plot: symbol color represents number of overlapping points

F9_2

該怎樣將網頁上的程式碼，套用到我們的圖上呢?

練習2答案

你必須把別人的語法想成一個函數，只要改變input即可：

x1 <- dat[,"SBP"] # 關鍵在這
x2 <- dat[,"DBP"] # 關鍵在這
df <- data.frame(x1,x2)

## Use densCols() output to get density at each point
x <- densCols(x1,x2, colramp=colorRampPalette(c("black", "white")))
df$dens <- col2rgb(x)[1,] + 1L

## Map densities to colors
cols <-  colorRampPalette(c("#000099", "#00FEFF", "#45FE4F", 
                            "#FCFF00", "#FF9400", "#FF3100"))(256)
df$col <- cols[df$dens]

## Plot it, reordering rows so that densest points are plotted on top
plot(x2~x1, data=df[order(df$dens),], pch = 19, col = col, cex = 2, ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP") # 關鍵在這

第三節：3D圖形(1)

R的社群提供了多個套件用來繪製3D圖形，我們提供幾個比較常用的套件為同學做介紹

– 真的想要畫好看的圖，Google後修改前人的程式碼最快

3D scatter plot是所有3D圖片中最常出現在醫學期刊的，經常被應用在『集群分析』時用的。

F9_3

第三節：3D圖形(2)

在這裡我們先運用內建的irir資料集，這是一個描述3種花的花瓣/花萼長寬的資料集(3種花各50個樣本)。

– 在R裡面要叫出內建資料集，可以使用函數「data()」。(許多套件在Example的部分都會先使用函數「data()」來呼叫一份內建資料)

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

我們有個假設，3種花彼此之間的花瓣/花萼長寬應該有段差距，但同樣一種花不同的50個樣本應該相當接近，現在我們要來驗證一下自己想的對不對

– 我們先用2D散布圖來看看…

COLOR = as.integer(iris$Species)+1 #先根據Species分顏色，顏色代碼2在R裡面是紅色；3是綠色；4是藍色

par(mfrow = c(2, 3))
plot(iris[,"Sepal.Length"], iris[,"Sepal.Width"], pch = 19, col = COLOR)
legend("topright", levels(iris$Species), pch = 19, col = 2:4, bg = "gray90")
plot(iris[,"Sepal.Length"], iris[,"Petal.Length"], pch = 19, col = COLOR)
legend("bottomright", levels(iris$Species), pch = 19, col = 2:4, bg = "gray90")
plot(iris[,"Sepal.Length"], iris[,"Petal.Width"], pch = 19, col = COLOR)
legend("bottomright", levels(iris$Species), pch = 19, col = 2:4, bg = "gray90")
plot(iris[,"Sepal.Width"], iris[,"Petal.Length"], pch = 19, col = COLOR)
legend("topright", levels(iris$Species), pch = 19, col = 2:4, bg = "gray90")
plot(iris[,"Sepal.Width"], iris[,"Petal.Width"], pch = 19, col = COLOR)
legend("topright", levels(iris$Species), pch = 19, col = 2:4, bg = "gray90")
plot(iris[,"Petal.Length"], iris[,"Petal.Width"], pch = 19, col = COLOR)
legend("bottomright", levels(iris$Species), pch = 19, col = 2:4, bg = "gray90")

雖然勉強能分開，但明顯可以發現versicolor與virginica比較接近，現在我們希望把這樣的分類放在3D散布圖上。

第三節：3D圖形(3)

這邊我們需要使用套件『scatterplot3d』。

library(scatterplot3d)

接著，我們選擇3個變項來繪製散布圖

scatterplot3d(x = iris[,"Sepal.Length"],
              y = iris[,"Sepal.Width"],
              z = iris[,"Petal.Length"],
              color = COLOR, pch = 19, angle = 40, main="3D Scatterplot")

感覺確實分的比較開了，但這個角度並不好調整，我們再試試其他套件

第三節：3D圖形(4)

這邊我們需要使用套件『rgl』。

– 套件『rgl』是在R裡面最常拿來繪製3D圖形的套件，他支援了互動式的3D圖像。

library(rgl)

接著，我們選擇3個變項來繪製散布圖

library(rgl)
plot3d(x = iris[,"Sepal.Length"],
       y = iris[,"Sepal.Width"],
       z = iris[,"Petal.Length"],
       col = COLOR, size = 3, main="3D Scatterplot")

F9_4

是不是酷多了?

練習3：3D圖片

現在我們知道了，資料視覺化要做得好，就是要在google上找尋適合的範例，以套件『rgl』為例，我們可以在A complete guide to 3D visualization device system in R - R software and data visualization看到完整的教學

– 我們來牛刀小試一下，剛剛的圖片我們想要命令R將我們的點圈起來，我們可以使用下面程式碼…

library(rgl)
plot3d(x = iris[,"Sepal.Length"],
       y = iris[,"Sepal.Width"],
       z = iris[,"Petal.Length"],
       col = COLOR, size = 3, main="3D Scatterplot")

VCOV = cov(iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length")])
MEAN = c(mean(iris[,"Sepal.Length"]), mean(iris[,"Sepal.Width"]), mean(iris[,"Petal.Length"]))

ellips = ellipse3d(VCOV, centre = MEAN, level = 0.95)
plot3d(ellips, col = "black", alpha = 0.2, add = TRUE, box = FALSE)

F9_5

現在，你能不能依據花的種類不同，切割出3個圈呢?

You must enable Javascript to view this page properly.

練習3答案

重點是要把一個改成三個：

plot3d(x = iris[,"Sepal.Length"],
       y = iris[,"Sepal.Width"],
       z = iris[,"Petal.Length"],
       col = COLOR, size = 3, xlab = "Sepal Length", ylab = "Sepal Width", zlab = "Petal Length")

VCOV = cov(iris[1:50,c("Sepal.Length", "Sepal.Width", "Petal.Length")])
MEAN = apply(iris[1:50,c("Sepal.Length", "Sepal.Width", "Petal.Length")], 2, mean)

ellips <- ellipse3d(VCOV, centre = MEAN, level = 0.95)
plot3d(ellips, col = "red", alpha = 0.2, add = TRUE, box = FALSE)


VCOV = cov(iris[51:100,c("Sepal.Length", "Sepal.Width", "Petal.Length")])
MEAN = apply(iris[51:100,c("Sepal.Length", "Sepal.Width", "Petal.Length")], 2, mean)

ellips <- ellipse3d(VCOV, centre = MEAN, level = 0.95)
plot3d(ellips, col = "green", alpha = 0.2, add = TRUE, box = FALSE)


VCOV = cov(iris[101:150,c("Sepal.Length", "Sepal.Width", "Petal.Length")])
MEAN = apply(iris[101:150,c("Sepal.Length", "Sepal.Width", "Petal.Length")], 2, mean)

ellips <- ellipse3d(VCOV, centre = MEAN, level = 0.95)
plot3d(ellips, col = "blue", alpha = 0.2, add = TRUE, box = FALSE)

You must enable Javascript to view this page properly.

小結

本次課程教了一個新的統計分析工具：ROC曲線。其實他能做到的事情『迴圈+邏輯斯迴歸』也能做到，但視覺化呈現後能讓資訊接收者更快掌握資訊，這就是資料視覺化的魅力。
另外，本次課程中同學學習到最重要的是學習如何利用Google找到與自己想做的事情相似的程式碼，並利用Google到的程式碼套用到自己的資料上。

– 如果你google的到，你甚至可以利用套件『rgl』做出gif的動畫檔案。(註：不一定要全程使用R做出來，只要能做出來就是好方法！)

F9_7

之後的資料視覺化課程，將教大家利用幾個知名的套件，畫出更多互動式圖形！

資料視覺化2

第一節：ROC曲線(1)

第一節：ROC曲線(2)

第一節：ROC曲線(2)

第一節：ROC曲線(3)

第一節：ROC曲線(4)

練習1：手刻ROC curve

練習1答案

第二節：色彩透明度與函數(1)

第二節：色彩透明度與函數(2)

第二節：色彩透明度與函數(3)

練習2：置換程式碼

練習2答案

第三節：3D圖形(1)

第三節：3D圖形(2)

第三節：3D圖形(3)

第三節：3D圖形(4)

練習3：3D圖片

練習3答案

小結